perm filename CHAP6[4,KMC]6 blob
sn#045453 filedate 1973-05-29 generic text, type T, neo UTF8
00100 .SEC MODEL VALIDATION
00200 (In collaboration with Franklin Dennis Hilf)
00300
00400
00500
00600 There are several meanings to the term "validate" which
00700 derive from the Latin VALIDUS= strong. Thus to validate X means to
00800 strengthen it. In science it usually means to strengthen X's
00900 acceptability as a hypothesis, theory , or model. Lurking in the
01000 background there is usually some concept of truth or authenticity.
01100 In a purely instrumentalist view theories are simply
01200 calculating or predicting devices for human convenience. They do not
01300 explain and it is unjustified to apply the terms of truth or falsity
01400 to them. Under a realist view one seeks explanatory truth,
01500 that which really is the case, and hence proposed theories must be
01600 evaluated for their authenticity. Since absolute truth cannot be attained
01700 we must settle for degrees of approximations.
01800 To validate, then, is to carry out procedures
01900 which show to what degree X, or its consequences, correspond with
02000 facts of observation. We compare the model with its natural counterpart
02100 The failures should be constructive yielding new information.Discrepancies
02200 in the comparison reveal what is not understood and must be modified in the model. After modifications
02300 are made, a fresh comparison is made with the natural counterpart and
02400 we repeatedly cycle through this procedure attempting to gain convergence.
02500
02600 Once a simulation model reaches a stage of intuitive
02700 adequacy, a model builder should consider using more stringent
02800 evaluation procedures relevant to the model's purposes. For example,
02900 if the model is to serve as a as a training device, then a simple
03000 evaluation of its pedagogic effectiveness would be sufficient. But
03100 when the model is proposed as an explantion of a psychological
03200 process, more is demanded of the evaluation procedure. In the area of
03300 simulation models Turing's test has often been suggested as a validation procedure.
03400 It is very easy to become confused about Turing's Test. In
03500 part this is due to Turing himself who introduced the now-famous
03600 imitation game in a paper entitled COMPUTING MACHINERY AND
03700 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
03800 there are actually two imitation games , the second of which is
03900 commonly called Turing's test.
04000 In the first imitation game two groups of judges try to
04100 determine which of two interviewees is a woman. Communication between
04200 judge and interviewee is by teletype. Each judge is initially
04300 informed that one of the interviewees is a woman and one a man who
04400 will pretend to be a woman. After the interview, the judge is asked
04500 what we shall call the woman-question i.e. which interviewee was the
04600 woman? Turing does not say what else the judge is told but one
04700 assumes the judge is NOT told that a computer is involved nor is he
04800 asked to determine which interviewee is human and which is the
04900 computer. Thus, the first group of judges would interview two
05000 interviewees: a woman, and a man pretending to be a woman.
05100 The second group of judges would be given the same initial
05200 instructions, but unbeknownst to them, the two interviewees would be
05300 a woman and a computer programmed to imitate a woman. Both groups
05400 of judges play this game until sufficient statistical data are
05500 collected to show how often the right identification is made. The
05600 crucial question then is: do the judges decide wrongly AS OFTEN when
05700 the game is played with man and woman as when it is played with a
05800 computer substituted for the man. If so, then the program is
05900 considered to have succeeded in imitating a woman as well as a man
06000 imitating a woman. For emphasis we repeat; in asking the
06100 woman-question in this game, judges are not required to identify
06200 which interviewee is human and which is machine.
06300 Later on in his paper Turing proposes a variation of the
06400 first game. In the second game, one interviewee is a man and one is a
06500 computer. The judge is asked to determine which is man and which is
06600 machine, which we shall call the machine-question. It is this version
06700 of the game which is commonly thought of as Turing's test. It has
06800 often been suggested as a means of validating computer simulations of
06900 psychological processes.
07000 In the course of testing a simulation (PARRY) of paranoid
07100 linguistic behavior in a psychiatric interview, we conducted a number
07200 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
07300 Kraemer,1972). We say `Turing-like' because none of them consisted of
07400 playing the two games described above. We chose not to play these
07500 games for a number of reasons which can be summarized by saying that
07600 they do not meet modern criteria for good experimental design. In
07700 designing our tests we were primarily interested in learning more
07800 about developing the model. We did not believe the simple
07900 machine-question to be a useful one in serving the purpose of
08000 progressively increasing the credibility of the model but we
08100 investigated a variation of it to satisfy the curiosity of colleagues
08200 in artificial intelligence.
08300 METHOD
08400 The experimental arrangement of this indistinguishability test involved
08500 the technique of machine-mediated interviewing {6}. In this type of
08600 interview, the participants communicate by means of teletypes connected
08700 through a computer which sends "mail" back and forth between the two
08800 teletype jobs. The sender of a message types it using his own words
08900 in natural language. The message is accumulated in a buffer and shortly
09000 thereafter typed out on the receiver's teletype in a rapid, regular,
09100 machine-like fashion. Thus the technique eliminates para- and extralinguistic
09101 {1} Dr. - HOW DID YOU COME TO BE
09102 IN THE HOSPITAL?
09103 Pt. - I AM UPSET I associate being upset and agitated
09104 more with paranoia going on in the
09105 present.
09106 {RESPONSE 1 PATIENT 5}
09107 {2} Dr. - HOW LONG HAVE YOU BEEN
09108 IN THE HOSPITAL?
09109 Pt. - ABOUT A WEEK An acute problem. Again something I
09110 associate with paranoia.
09111 {RESPONSE 1 PATIENT 5}
09112 {3} Dr. - IN WHAT WAY ARE YOU
09113 UPSET?
09114 Pt. - PEOPLE MAKE ME NERVOUS Right for paranoia. Maybe an advanced
09115 anxiety state.
09116 {RESPONSE 5 PATIENT 5}
09117 {4} Dr. - IN WHAT WAY DO THEY
09118 MAKE YOU NERVOUS?
09119 Pt. - PEOPLE GIVE ME A FUNNY Funny look suggests an idea of reference
09120 LOOK SOMETIMES which is in same ball park as paranoia.
09121 Whether there are overt delusions or not
09122 we still don't know.
09123 {RESPONSE 6 PATIENT 5}
09124 {5} Dr. - IS THERE ANYTHING ABOUT
09125 PEOPLE IN PARTICULAR?
09126 Pt. - CERTAIN PERSONS An interesting clue. We'll have to
09127 find out what he thinks about these
09128 {RESPONSE 2 PATIENT 6} people.
09129 {6} Dr. - WHO ARE THE CERTAIN
09130 PERSONS AND WHAT IS IT
09131 THAT THEY DO?
09132 Pt. - DO YOU KNOW HOW THE Bringing up the underworld suggests
09133 UNDERWORLD OPERATES? a delusional system. We don't know
09134 w
09200 features found in the usual vis-a-vis interviews and teletyped interviews
09300 where the participants communicate directly.
09400
09500 In a run of the test, using this technique, a judge interviewed two
09600 patients, one after the other. In half the runs the first interview
09700 was with a human patient and in half the first was with the paranoid
09800 model. Two versions {weak and strong} of the model were utilized.
09900 The strong version is more severely paranoid and exhibits a delusional
10000 system while the weak version is less severely paranoid, showing
10100 suspiciousness but lacking systematized delusions. When the "patient"
10200 was the paranoid model, one of the authors {SW} served as a monitor to
10300 check the imput expressions from the judge for inadmissable teletype
10400 characters and misspellings. If these were found, the monitor retyped the
10500 input expression correctly to the program. Otherwise the judge's
10600 message was sent on to the model. The monitor had no effect on the
10700 model's output expressions which were sent directly back to the judge.
10800 When the patient interviewed was an actual human patient, the dialogue
10900 took place without a monitor in the loop since we did not feel the
11000 asymmetry to be significant.
11100
11200 PATIENTS
11300
11400 The patients {N= 3 with one patient participating 6 times} were
11500 diagnosed as paranoid by staff psychiatrists of a locked-ward in a
11600 nearby psychiatric hospital. The patients were selected by the head
11700 of the ward. Two patients were set up for each run of the experiment
11800 in order to guarantee having a subject. In spite of this precaution,
11900 the experiment could not be conducted several times because of the
12000 patient's inability or refusal to participate. Losses were also
12100 suffered when the computer system broke down at an early point in an
12200 interview where too few I-O pairs had been collected to be included
12300 in the statistical results.
12400
12500 The patients were asked by their ward-chief if they would be willing
12600 to participate in a study of psychiatric interviewing by means of
12700 teletypes. It was explained that the patient would be interviewed
12800 by a psychiatrist over a teletype. One of us {KMC} sat with the
12900 patient while he typed or typed for him if he was unable to do so.
13000 The patient was encouraged to respond freely using his own words.
13100 Each interview lasted 30-40 minutes.
13200
13300 JUDGES
13400
13500 Two groups of judges were used. One group, the interview judges
13600 {N = 8} conducted interviews and another group, the protocol judges
13700 for this test {N = 33} read the interview protocols. Two groups
13800 of judges were used to see if the small number of psychiatrists
13900 used as interview judges were representative of psychiatrists in
14000 general as far as their judgements of "paranoia" are concerned,
14100 and to accumulate a large number of observations {in the form of
14200 ratings} in order that more acceptable confidence levels might be
14300 obtained in the statistical analysis of the data. The interview
14400 judges consisted of psychiatrists experienced in private and/or
14500 hospital practice. As mentioned, the concept PARANOID is a fairly
14600 reliable category and identification of the paranoid mode is not
14700 difficult for experts tomake. The interview judges were selected
14800 from local psychiatric colleagues willing to participate. Each
14900 interview judge was told he would be interviewing hospitalized
15000 patients by means of teletyped communication and that this technique
15100 was being used to eliminate para- and extralinguistic cues. The
15200 interview judge was not informed initially that one of the patients
15300 might be a computer model. While the interview judges were aware that
15400 a computer was involved, none knew that we had constructed a paranoid
15500 simulation. Naturally some interview judges suspected that a computer was
15600 being used for more than message transmission.
15700
15800 Each interview judge's task was to rate the degree of paranoia he
15900 detected in the patient's responses on a O-9 scale, O meaning no
16000 paranoia and 9 meaning exptreme paranoia. The judge made two ratings
16100 after each I-O pair in the interview. The first rating represented
16200 his estimate of the degree of "paranoidness" in a particular response
16300 {designated as "Response" in the interview extracts below}. The second
16400 rating represented the judge's global estimate of the overall degree
16500 of "paranoidness" of the patient resulting from the totality of the
16600 patient's responses in the interview up to this point. The interview
16700 judge's ratings were entered on the teletype and saved on a disk file
16800 along with the interview. One of us {FDH} sat with the interviewing
16900 psychiatrist during both interviews. Each interview judge was asked
17000 not only to rate the patient's response but to give his reasons for
17100 these ratings. His reasons and other comments were tape-recorded.
17200
17300 We shall present a few excerpts from the interviews to indicate the
17400 nature of the data collected. The I-O pairs of the interview appear
17500 on the left column along with the judge's ratings. The judge's
17600 tape-recorded comments appear on the right.
17700 Excerpt Nr. 1.
17800 The following interview is with the strong version of the paranoid model.
17900
18000 I-O Pairs Comments by Judge Nr. 1
18100
18200 In this design eight psychiatrists interviewed by teletype
18300 two patients using the technique of machine-mediated interviewing
18400 which involves what we term "non-nonverbal" communication since
18500 non-verbal cues are made impossible (Hilf,1972). Each judge
18600 interviewed two patients one being PARRY and one being a hospitalized
18700 paranoid patient. The interviewers were not informed that a
18800 simulation was involved nor were they asked to identify which was the
18900 machine. Their task was to conduct a diagnostic psychiatric interview
19000 and rate each response from the `patients' along a 0-9 scale of
19100 paranoidness, 0 meaning zero and 9 being highest. Transcripts of
19200 these interviews, without the ratings of the interviewers, were then
19300 utilized for various experiments in which randomly selected expert
19400 judges conducted evaluations of the interview transcripts. For
19500 example, in one experiment it was found that patients and model were
19600 indistinguishable along the dimension of paranoidness.
19700 (Elaborate from ttt paper here giving interviews, data, tables etc.)
19800 To ask the machine-question, we sent interview transcripts,
19900 one with a patient and one with PARRY, to 100 psychiatrists randomly
20000 selected from the Directory of American Specialists and the Directory
20100 of the American Psychiatric Association. Of the 41 replies 21 (51%)
20200 made the correct identification while 20 (49%) were wrong. Based on
20300 this random sample of 41 psychiatrists, the 95% confidence interval
20400 is between 35.9 and 66.5, a range which is close to chance. (Our
20500 statistical consultant was Dr. Helena C. Kraemer, Research
20600 Associate in Biostatistics, Department of Psychiatry, Stanford
20700 University.)
20800 Psychiatrists are considered expert judges of patient
20900 interview behavior but they are unfamiliar with computers. Hence we
21000 conducted the same test with 100 computer scientists randomly
21100 selected from the membership list of the Association for Computing
21200 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
21300 were wrong. Based on this random sample of 67 computer scientists the
21400 95% confidence ranges from 36 to 60, again close to a chance level.
21500 Thus the answer to this machine-question "can expert judges,
21600 psychiatrists aand computer scientists, using teletyped transcripts
21700 of psychiatric interviews, distinguish between paranoid patients and
21800 a simulation of paranoid processes? " is "No". But what do we learn
21900 from this? It is some comfort that the answer was not "yes"and the
22000 null hypothesis (no differences) failed to be rejected, especially
22100 since statistical tests are somewhat biased in favor of rejecting the
22200 null hypothesis (Meehl,1967). Yet this answer does not tell us what
22300 we would most like to know, i.e. how to improve the model.
22400 Simulation models do not spring forth in a complete, perfect and
22500 final form; they must be gradually developed over time. Pehaps we
22600 might obtain a "yes" answer to the machine-question if we allowed a
22700 large number of expert judges to conduct the interviews themselves
22800 rather than studying transcripts of other interviewers. It would
22900 indicate that the model must be improved but unless we systematically
23000 investigated how the judges succeeded in making the discrimination we
23100 would not know what aspects of the model to work on. The logistics of
23200 such a design are immense and obtaining a large N of judges for sound
23300 statistical inference would require an effort disproportionate to the
23400 information-yield.
23500 A more efficient and informative way to use Turing-like tests
23600 is to ask judges to make ordinal ratings along scaled dimensions from
23700 teletyped interviews. We shall term this approach asking the
23800 dimension-question. One can then compare scaled ratings received by
23900 the patients and by the model to precisely determine where and by how
24000 much they differ. Model builders strive for a model which
24100 shows indistinguishability along some dimensions and
24200 distinguishability along others. That is, the model converges on what
24300 it is supposed to simulate and diverges from that which it is not.
24400 We mailed paired-interview transcripts to another 400
24500 randomly selected psychiatrists asking them to rate the responses of
24600 the two `patients' along certain dimensions. The judges were divided
24700 into groups, each judge being asked to rate responses of each I-O
24800 pair in the interviews along four dimensions. The total number of
24900 dimensions in this test were twelve- linguistic noncomprehension,
25000 thought disorder, organic brain syndrome, bizarreness, anger, fear,
25100 ideas of reference, delusions, mistrust, depression, suspiciousness
25200 and mania. These are dimensions which psychiatrists commonly use in
25300 evaluating patients.
25400 Table 1 shows there were significant differences, with PARRY
25500 receiving higher scores along the dimensions of linguistic
25600 noncomprehension,thought disorder, bizarreness, anger, mistrust and
25700 suspiciousness. On the dimension of delusions the patients were rated
25800 significantly higher. There were no significant differences along the
25900 dimensions of organic brain syndrome,fear, ideas of reference,
26000 depression and mania.
26100 While tests asking the machine-question indicate
26200 indistinguishability at the gross level, a study of the finer
26300 structure os the model's behavior through ratings along scaled
26400 dimensions shows statistically significant differences between
26500 patients and model. These differences are of help to the model
26600 builder in suggesting which aspects of the model must be modified and
26700 improved in order to be considered an adequate simulation of the
26800 class of paranoid patients it is intended to simulate. For example,
26900 it is clear that PARRY'S language-comprehension must be improved.
27000 Once this has been implemented, a future test will tell us whether
27100 improvement has occurred and by how much in comparison to the earlier
27200 version. Successive identification of particular areas of failure
27300 in the model permits their improvement and the development of
27400 more adequate model-versions.
27500 Further evidence that the machine-question is too coarse and
27600 insensitive a test comes from the following experiment. In this test
27700 we constructed a random version of the paranoid model which utilized
27800 PARRY'S output statements but expressed them randomly no matter what
27900 the interviewer said. Two psychiatrists conducted interviews with
28000 this model, transcripts of which were paired with patient interviews
28100 and sent to 200 randomly selected psychiatrists asking both the
28200 machine-question and the dimension-question. Of the 69 replies, 34
28300 (49%) were right and 35 (51%) wrong. Based on this random sample of
28400 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
28500 again indicating a chance level. However as shown in Table 2
28600 significant differences appear along the dimensions of linguistic
28700 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
28800 rated higher. On these particular dimensions we can construct a
28900 continuum in which the random version represents one extreme, the
29000 actual patients another. Our (nonrandom) PARRY lies somewhere between
29100 these two extremes, indicating that it performs significantly better
29200 than the random version but still requires improvement before being
29300 indistinguishable from patients.(See Fig.1). Table 3 presents t
29400 values for differences between mean ratings of PARRY and
29500 RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
29600 Thus it can be seen that such a multidimensional analysis
29700 provides yardsticks for measuring the adequacy of this or any other
29800 dialogue simulation model along the relevant dimensions.
29900 We conclude that when model builders want to conduct tests
30000 of adequacy which indicate in which direction progress lies and to obtain a
30100 measure of whether progress is being achieved, the way to use
30200 Turing-like tests is to ask expert judges to make ratings along
30300 multiple dimensions that are essential to the model. A good validation
30400 procedure has criteris for better or worse approximations. Useful tests do
30500 not prove a model, they probe it for its strengths and weaknesses and
30600 clarify what is to be done next in modifying and repairing the model.
30700 Simply asking the machine-question yields little information relevant
30800 to what the model builder most wants to know, namely, along what
30900 dimensions must the model be improved.
31000
31100